{
 "cells": [
  {
   "cell_type": "code",
   "id": "initial_id",
   "metadata": {
    "collapsed": true,
    "ExecuteTime": {
     "end_time": "2024-09-01T06:52:25.289810Z",
     "start_time": "2024-09-01T06:52:24.962993Z"
    }
   },
   "source": [
    "import os\n",
    "import json\n",
    "from langchain_core.prompts import ChatPromptTemplate  \n",
    "from models import KnowledgeGraphLLM"
   ],
   "outputs": [],
   "execution_count": 35
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-01T06:43:31.139025Z",
     "start_time": "2024-09-01T06:43:31.133711Z"
    }
   },
   "cell_type": "code",
   "source": [
    "# load sample data\n",
    "data = json.load(open('data/test/concept_abstracts_sample.json', 'r'))\n",
    "relation_def = json.load(open('data/relation_types.json'))\n",
    "relation_types = list(relation_def.keys())"
   ],
   "id": "410c61e96651269",
   "outputs": [],
   "execution_count": 5
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-01T06:55:54.583270Z",
     "start_time": "2024-09-01T06:55:54.574458Z"
    }
   },
   "cell_type": "code",
   "source": [
    "# initialize the prompt template\n",
    "prompt_template_txt = open(\"prompts/prompt_step_01.txt\").read()\n",
    "prompt_template = ChatPromptTemplate.from_messages([\n",
    "    (\"system\", \"You are a knowledge graph builder.\"),\n",
    "    (\"user\", prompt_template_txt)\n",
    "])\n",
    "\n",
    "print('Prompt template', prompt_template)"
   ],
   "id": "623fe041dbf5b05a",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prompt template input_variables=['abstracts', 'concepts', 'relation_definitions'] messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], template='You are a knowledge graph builder.')), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['abstracts', 'concepts', 'relation_definitions'], template='### Instruction:\\nYou are a domain expert in computer science, natural language processing, and now you are building a knowledge graph in this domain. Given a context (### Content), and a query concept (### Concept), do the following:\\n1. Extract the query concept and some in-domain concepts from the context, these concepts should be fine-grained: could be introduced by a lecture slide page, or a whole lecture, or possibly to have a Wikipedia page.\\n2. Determine the relationships between the query concept and the extracted concepts from Step 1, in a triplet format: (<head concept>, <relation>, <tail concept>). The relationship should be functional, aiding learners in understanding the knowledge. The query concept can be the head concept or tail concept. We define 7 types of the relations:\\n    {relation_definitions}\\n3. Some relation types are strictly directional. For example, \"A tool is evaluated for B\" indicates (A, Evaluate-for, B), NOT (B, Evaluate-for, A). Among the seven relation types, only \"a) Compare\" and \"c) Conjunction\" are not direction-sensitive.\\n4. You can also extract triplets from the extracted concepts, and the query concept may not be necessary in the triplets.\\n5. Your answer should ONLY contain a list of triplets, each triplet is in this format: (concept, relation, concept). For example: \"(concept, relation, concept)(concept, relation, concept).\" No numbering and other explanations are needed.\\n6. If ### Content is empty, output None.\\n\\n\\n### Content: {abstracts}\\n\\n### Concept: {concepts}'))]\n"
     ]
    }
   ],
   "execution_count": 43
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-01T06:50:30.349486Z",
     "start_time": "2024-09-01T06:50:30.339808Z"
    }
   },
   "cell_type": "code",
   "source": [
    "concept_name = \"spelling correction\"\n",
    "print('Sample concept:', concept_name)\n",
    "abstracts = ' '.join(data[concept_name]['abstracts'])\n",
    "print('Used abstracts:', abstracts[:500])\n",
    "\n",
    "# instantiate the prompt template\n",
    "prompt = prompt_template.invoke(\n",
    "    {\"abstracts\": abstracts[:10000],\n",
    "     \"concepts\": [concept_name],\n",
    "     \"relation_definitions\": '\\n'.join(\n",
    "         [f\"{rel_type}: {rel_data['description']}\" for rel_type, rel_data in\n",
    "          relation_def.items()])})"
   ],
   "id": "37752d971f09383a",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sample concept: spelling correction\n",
      "Used abstracts: We propose a novel approach to OCR post-correction that exploits repeated texts in large corpora both as a source of noisy target outputs for unsupervised training and as a source of evidence when decoding. A sequence-to-sequence model with attention is applied for single-input correction, and a new decoder with multi-input attention averaging is developed to search for consensus among multiple sequences. We design two ways of training the correction model without human annotation, either traini\n"
     ]
    }
   ],
   "execution_count": 34
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-01T06:56:08.182229Z",
     "start_time": "2024-09-01T06:56:08.178620Z"
    }
   },
   "cell_type": "code",
   "source": "print('Prompt:', prompt)",
   "id": "f7c8352db7301dcd",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prompt: messages=[SystemMessage(content='You are a knowledge graph builder.'), HumanMessage(content='### Instruction:\\nYou are a domain expert in computer science, natural language processing, and now you are building a knowledge graph in this domain. Given a context (### Content), and a query concept (### Concept), do the following:\\n1. Extract the query concept and some in-domain concepts from the context, these concepts should be fine-grained: could be introduced by a lecture slide page, or a whole lecture, or possibly to have a Wikipedia page.\\n2. Determine the relationships between the query concept and the extracted concepts from Step 1, in a triplet format: (<head concept>, <relation>, <tail concept>). The relationship should be functional, aiding learners in understanding the knowledge. The query concept can be the head concept or tail concept. We define 7 types of the relations:\\n    Compare: Represents a relationship between two or more entities where a comparison is being made. For example, \"A is larger than B\" or \"X is more efficient than Y.\"\\nPart-of: Denotes a relationship where one entity is a constituent or component of another. For instance, \"Wheel is a part of a Car.\"\\nConjunction: Indicates a logical or semantic relationship where two or more entities are connected to form a group or composite idea. For example, \"Salt and Pepper.\"\\nEvaluate-for: Represents an evaluative relationship where one entity is assessed in the context of another. For example, \"A tool is evaluated for its effectiveness.\"\\nIs-a-Prerequisite-of: This dual-purpose relationship implies that one entity is either a characteristic of another or a required precursor for another. For instance, \"The ability to code is a prerequisite of software development.\"\\nUsed-for: Denotes a functional relationship where one entity is utilized in accomplishing or facilitating the other. For example, \"A hammer is used for driving nails.\"\\nHyponym-Of: Establishes a hierarchical relationship where one entity is a more specific version or subtype of another. For instance, \"A Sedan is a hyponym of a Car.\"\\n3. Some relation types are strictly directional. For example, \"A tool is evaluated for B\" indicates (A, Evaluate-for, B), NOT (B, Evaluate-for, A). Among the seven relation types, only \"a) Compare\" and \"c) Conjunction\" are not direction-sensitive.\\n4. You can also extract triplets from the extracted concepts, and the query concept may not be necessary in the triplets.\\n5. Your answer should ONLY contain a list of triplets, each triplet is in this format: (concept, relation, concept). For example: \"(concept, relation, concept)(concept, relation, concept).\" No numbering and other explanations are needed.\\n6. If ### Content is empty, output None.\\n\\n\\n### Content: We propose a novel approach to OCR post-correction that exploits repeated texts in large corpora both as a source of noisy target outputs for unsupervised training and as a source of evidence when decoding. A sequence-to-sequence model with attention is applied for single-input correction, and a new decoder with multi-input attention averaging is developed to search for consensus among multiple sequences. We design two ways of training the correction model without human annotation, either training to match noisily observed textual variants or bootstrapping from a uniform error model. On two corpora of historical newspapers and books, we show that these unsupervised techniques cut the character and word error rates nearly in half on single inputs and, with the addition of multi-input decoding, can rival supervised methods. The quality of a counseling intervention relies highly on the active collaboration between clients and counselors. In this paper, we explore several linguistic aspects of the collaboration process occurring during counseling conversations. Specifically, we address the differences between high-quality and low-quality counseling. Our approach examines participants’ turn-by-turn interaction, their linguistic alignment, the sentiment expressed by speakers during the conversation, as well as the different topics being discussed. Our results suggest important language differences in low- and high-quality counseling, which we further use to derive linguistic features able to capture the differences between the two groups. These features are then used to build automatic classifiers that can predict counseling quality with accuracies of up to 88%. The Minecraft Collaborative Building Task is a two-player game in which an Architect (A) instructs a Builder (B) to construct a target structure in a simulated Blocks World Environment. We define the subtask of predicting correct action sequences (block placements and removals) in a given game context, and show that capturing B’s past actions as well as B’s perspective leads to a significant improvement in performance on this challenging language understanding problem.\\n\\n### Concept: [\\'spelling correction\\']')]\n"
     ]
    }
   ],
   "execution_count": 44
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-01T06:52:59.073922Z",
     "start_time": "2024-09-01T06:52:55.900977Z"
    }
   },
   "cell_type": "code",
   "source": [
    "os.environ[\"OPENAI_API_KEY\"] = json.load(open('private_config.json'))['OPENAI_API_KEY']\n",
    "# init the model\n",
    "model = KnowledgeGraphLLM(model_name=\"nlp_gpt-3.5-turbo\",\n",
    "                              max_tokens=400)\n",
    "\n",
    "# query the model\n",
    "response = model.invoke(prompt)"
   ],
   "id": "82dfc02974d665ef",
   "outputs": [],
   "execution_count": 36
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-01T06:55:30.476100Z",
     "start_time": "2024-09-01T06:55:30.472966Z"
    }
   },
   "cell_type": "code",
   "source": [
    "print('Extracted triples:')\n",
    "for triple in json.loads(response):\n",
    "    print(', '.join([triple['s'], triple['p'], triple['o']]))"
   ],
   "id": "cb4b7ebe3a92879b",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Extracted triples:\n",
      "OCR post-correction, Compare, spelling correction\n",
      "OCR post-correction, Evaluate-for, training the correction model without human annotation\n",
      "OCR post-correction, Evaluate-for, bootstrapping from a uniform error model\n",
      "OCR post-correction, Used-for, cutting the character and word error rates\n",
      "counseling intervention, Is-a-Prerequisite-of, active collaboration between clients and counselors\n",
      "counseling intervention, Compare, collaboration process during counseling conversations\n",
      "counseling intervention, Compare, differences between high-quality and low-quality counseling\n",
      "counseling intervention, Evaluate-for, examining participants’ turn-by-turn interaction\n",
      "counseling intervention, Evaluate-for, deriving linguistic features to capture differences\n",
      "counseling intervention, Used-for, building automatic classifiers to predict counseling quality\n",
      "Minecraft Collaborative Building Task, Compare, two-player game\n",
      "Minecraft Collaborative Building Task, Compare, predicting correct action sequences\n",
      "Minecraft Collaborative Building Task, Compare, capturing Builder's past actions\n",
      "Minecraft Collaborative Building Task, Compare, Builder's perspective\n"
     ]
    }
   ],
   "execution_count": 42
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": "",
   "id": "ce7dddcc60b9182f"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
